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NUCI.EIC ACro-I.EVEL ANAI.YSfS OF PROTEIN STRUCTURE 

Back ground o f the Invention 

The invention relates to methods of evaluating, altering, and designing protein 
5 slructures. 



M^j^lMrxM^hg Inyentiop 

Methods of the invention incorporate considerations of niRNA sequence and 

structure and codon-antia^don energetics into the analysis and design of protein 
10 structure. Many prior tut methods for analyzing or designing protein structure have 

reUed in part or in whole on analysis at the amino acid level Proieins, however, are the 

prodiict of a process which involves a number of celKjlar entities and their interactions. 

The interaction of niRNA molecules with the protein translation machinery (e.g., 

ribosomcs, and tRNAs, as wcH as other elements of the cellular environment, such as 
1 5 water and salt molecules) and niRMA intrachain interactions place physico-chemical 

restraints on the overall process. 



Not only are proteins the product of a process, but the process itself has evolved 
over dme. Some constraints, e.g., those imposed by the interaction of mRNAs with 
20 environmental eiciDcnts or with priraitivc ribosoraal structures, those of mRN A structure 
and energetics, e.g., the {s-opensity to form secondary structure, may have been more 
important, or at least diffcrejit, primordially than they are currently. While not wishing 
to be boutid by theory, the inventors postulate that evidence of those prior constraints 
may be seen in tlie sequence of current messages. 

25 

Methods i<t the in\ent\cm provide lor the .m<^j>M-. and design of protetn stuiciurci 
on the basiN of ranet o ^e iii rcs ox t -c n ic.e c acid mesira^t; c t; , cudon usige 
patterns (.1 ;.o tr.\ ,j! Mw- \ .t i:- me '■ion ,i!C hised eji diMiting the 

iienetR couc tl^.. \ t it tuour- -nvi^ d.>n n^. is wructi speciiv dmiiu) aciJ.-. land 3Sop^) 

30 mtti 1.1 t^st. s, sonutnnes ^'Urrt A > ^e ctn >is subcodes o) c nding iJalitic^ .snti 

e\ alnatmg a nucka acid sequence which encodes a protein structute based on Jts tUrs 
( e , snbeodo et rodtnj niodaiit>). Relevant subcodes oi codtuy pjodahfies <,dn be 
defined using thojcc parameters which arc a function of messaf'c4c\ el propettic;,. 
wherein each property is related to the composition or structure of the nucleic acid, and 

35 is other than the identity of the amino acid (or slop) encoded and other than codon bias. 
Examples of structural choice parameters, which can serve as methods or rules for 
assignment of eodons into classes, include the nature of the substituents on the coding 
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Kises tc,^' . <-<.t-o.it]od Lv'V-riL'h I an. Ci annuo-rsch buipts A ond f, ). m7o ot tiu' 
vt'cUiig bases (J I , jiucn.. r\\ ;a ■ t M.^c-ii.'nJjuy md basc-.njdanj eijei^iios 
of the coding bases in overlapping ba,se pairs, and the ijke, fixampk's ol compositional 
choice parametcfs inciude frequencies of subdasses of codons within more than one of 
5 the three altemaiive reading frames in which a nucieie acid message can be read. 
Alternative subcodes or coding modaUties are not necessarily entirely disjointed, 
discrete, or unique, and identical subcodes or coding modalities caii be obtained using 
structural and/or compositional parameters. 

1 0 Methods of the invention allow the ideatifieation, analysis, modification and 

design of protein structures on the basis of patterns or features revealed by the nucieie 
acid, e.g., tiic messenger nucleic acid. For example, the identi fication of a "run" of 
amino acids residues of a class can be indicative of an evolutionarily con.served region. 
The identification of a "minority" class codon in a am of majority class codons can be 

15 indicative of a stnicture- or function-critical residue. The discovery of a criticaJ residue 
can be used in the design or modification of a protein, e.g., to develop a second 
generation protein. For exampie, in situations where it is desirable to alter structure or 
activity of a protein, it may be desirabie to alter a critical residue($) (or a residue which 
interacts with a critical residue, e.g., an adjacent residue or a residue elsewhere in the 

20 protein (or in another protein) witlt which it interacts). In the case where a change which 
does not resuh in significant alterations in structure or activity is desired, residues other 
than the identified critical residue (or other than residues which interact with it) are 
changed. 

25 Methods ,u the invention provide foi neatest neighbor froi.jucncics cakulaied 

based upon the irequcnc} ot paliem ot seiecied classes ot codons, i e . tiv codon class of 
the amtno acui. and thus pro\idc a highei degree ot re!cvanL,c foranal>\Ks o} smgle- 
class-nch protcui structures. Oonvctrtionat tables of nearest ucighboi ammo aetds do not 
take into aeeourt the classt^s drscrih^^d '■'crt-i^i. and is such provide inil) ' avcra^'e'' 

30 \alues acrot.fi muUipk cUt^ses of colors Als.o. unhke lJ-i!es ot the invention 

C'liivcntionai uc'tiest-utigliiini tables <!o iiot t'^Le uifo .k..o vM du. fatt tliat ^.onsisicni 
scoondar>/terttar\ stmctmes of protems car; bc^ sho\%'; ai ^< Mca.e \\iih .0 'out of 
frame" propcrues of protein messages, b) "mteiframe" pioperues ol protein messages, 
i e.. correlations between properties of messages read in frame 1 , properties of messages 

35 read in frame 2, and/or properties of messages read in frame 3, as defined below. 
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!n general, the tnvcniion feamres a method of evaiuating protein slriictiite. The 
method incktdes: 

ptovidmg a tiiioleic acid sequence which encodes the pKJtem stacUire; 
assorting bases of the nucleic acid sequence inio subject triplets; and 
5 assigning one or a plurality of subject triplets to one of a plm ality of ciasscs, 

wherein the assignment is a function of classifying triplets of the uucleic acid sequence 
as membei-s of a class of a binary choice alphabet of n degrees of freedom, and wherein 
the clas-sei. tan be Jtencratcd bs' apphuig n hmmy choice paranieteis lo a tnp'ct to j icW 
at least 2" cldssts of subiett tnplets wheiem a bmarj choice pardmctcr js d ^ut^UKW of a 
10 message-levei propertv of ihe nucleic acid sequence. 

lhtri.h\ t.\<-i.iualH)" ft>e protein strutiuic Iripkt <. m Hl -assigned to i tlaij'- based 
on w iJttiuM thev St tistv 1 \ alue toi the nif >a-:e le^ cl proper^ e ^ tnpkt c in b<. 
ts-u iwd io i. i,Ltss used on whuihc is \ jj ^ io <i p ira ns-lci is abo^ n ott < 
i Kdtnncd vihu. c Si t.ndidlp\ kr ^ or >>f a vod^i^i miuodon dupkv ; wh^iiui 
15 01 nol t posses:, a puilRuKu t,lwi Klv .bl e_ vhth ii ( I n > ( i rut. s n 

level propeih is othei than, the jdentit> o+'ilie amino acid or puntiiXtUion v^hKh ^ tupltt 
encodes and is other than codon bias. 

line class constant table ptwvides a measure of the frequency witii which a first 
20 and a second amino acid occur as nearest neighbors and wherein nearest neighbor 

frequencies are determined within a codon class, and wherein a class is a function of a 
message level property of a nucleic acid, e.g., the codon, which encodes an amino acid, 
The class can be any class generated by the binary choice p^iramster-based methods 
referred to herein. For example, if tlie classes are a first class, e.g., high enthalpy codons 
25 and a second claiis, e.g., low enthalpy codons, the table is generated for nearest 

neighbors where both neighbors are encoded by codons of either the fmt class or codons 
of the second clas.s. 

In another aspect, the invention ieaiures a metliod of evaluating a protein 
30 structure. The meOiod includes: providing a class-constant table of nearest neighbor 
relationships for amino acid residues; 

providing a niicleic acid which encodes a protein structure; and 
comparing one or a piuralitj' of the observed nearest neighbor pairs in the protein 
structure with the frct|ueiicies prnvided by the class constant table, thereby evaluating 
35 the protein structure. 
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III prettTfcJ tnibovhnKnts, the comparison can inciude: assigning an expected 
t)equt.-a( y from tht- tifi--^ Cll!^^■ ui; rr -'V '-^ .^re or a pluraiit}' of the observed liearest 
iK'i!iht.:!or paiii) and dei«;unining how many of the obsemd nearest neighbor pairs fell 
above or below a predetermined value; detemiinhig the ijkelih<w>d of occuttence, as 
5 predicted by the class constant table, for an observed nearest neighbor pair.; or 
determining if an observed nearest neighbor pair of a first and a second amino acid 
residue firom the protein structure is predicted by the class constant table to occur at a 
predetermined frequency. 

1 0 In another aspect, the invention features a method of evaluating a protein 

structure for i-esistance to change, e.g., evoiutionaty or mutational change. The method 

includes; 

identifying regions of a protein which is encoded by runs of a single subcode, 
thereby identiiyirig regions which havt; been resistant to change and which are therefor 
15 predicted to be functionaily or strueturaliy significant. E.g., the method can mclude 
detennining if the nucleic acid sequence wliich encodes the protein structure includes a 
run of triplets, e.g., a run at least 20, 40, 60, or 120 triplets in length, in which at least 20, 
40k 60, 80, 90 or 95 %, or all, of the triplets in the run are from one class. Any of the 
Vi'sys of generating classes described herein can be used in this method. 

20 

In another aspect, the invention includes, a method of evaluating a protein 
structure tor the presence of criticd amino acid residues. The method includes: 

identifying critical amino acid residues by identifying "rainority codons" in runs 
encoded by codons of a single class or subcode, thereby identifying residues which have 
25 been resistant to change and which are therefor believed to be fimctionally important. 
Any of tlie ways of generating classes described herein can be used in this method. 

In another aspect, the invention features a method for evaluating a protein 
structur e, 1 he meihoa includes: 
30 providinfi a nucleic acid sequcr.ei. u iwcn eneoJi^s the proietn structure; 

as.srij1me bases oi the nucleic acid scqner.ce into siibjcv t fnptcts, and 
assigning at least one of the subject triplets to one of a plurality ol clashes, 
whciein the a$>,ignment is a function of classifying the .subject inpiels ut~ tiie nucleic a<.id 
sequence under a binary choice alphabet of n degrees of freedom by applying n binarv' 
35 choice parameters to a triplet to yield at least 2" classes of subject triplets, wherein the 
assignment pKmdes at least four classes of triplets, the at least four classes of triplets 
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being represented in at least a portion of the nucieic acid sequence in a ratio of about 
3:5:3:5; 

thereby evaluating the protein structure. 

5 In another aspect, the invention features a iaethod for Meniifytng coding regions 

of a nucleic acid sequence, the method comprising: 
providiag tiie nucleic acid sequence; 

assorting bases of at least a portion of the nucleic acid sequence into a piuraiity 
of subject triplets; 

1 0 assigning the pluraHly of subject tripiel s to one of a plui ality of classes, wherein 

the assignment is a tiinction of classifying the subject tripbts of the nucleic acid 
sequence under a binarj' choice alphabet of n degrees of freedom by applying n binary 
choice parameters to a triplet to yield at ieasi 2" claisses of subject triplets, ^'herein the 
assignment provides at least four classes of triplets A, B, C, joid D; 

1 5 deiermining whether the plurality of subject triplets fire distributed into ihe at 

least four classes of triplets A:B:C:D in a ratio of about 3:5:3:5; 

thereby identifying coding regions of the nucleic acid sequence. 

In another aspect, the invention features, a method for identifying a protein that 
20 includes a polypeptide portion which is structurally or fiinctionally simiim- to all or a 
portion of a test protein, the method comprising; 

providing a nucleic acid sequence vt^hich encodes al! or a portion of the test 
protein; 

assorting bases of at leiist a portion of the nucleic acid sequence into a plurality 
25 of subject triplets tn a first readmg frame; 

assigning tlte plurality of subject triplets m tht^ fi st tending iuine to one ot d 
phxraiii\ of cla'.scb, wherein the asMgnment is a functK:)n of elasstt\ mg thi. nibjecl 
mpkts of die nuLietL acid svqiien;.e under a first binau i,i«))ce alj^hahel o1 n debtees of 
fioedont b> appi\ ini* n first bmatv choice parameters to a tiiplct to yield at lea&L2 ' 
30 elasseb ol bubfeet Uiplets, iicieni Uic assigiunenl provides at least four classes of 
tuplets distributed m a ratio of about ^-Sir-S; 

as sorting bases of the at least a portion of the nucleic acid sequence into a 
piuraiiiy of subject triplets m a second reading frame; 

avSsignmg the piuraiiiy of subject tnplets in the second reading frame to one of a 
35 plurality of classes, wherein the assigmnent is a function ot classifying the subject 

triptets of the nucleic acid sequence under a second binary choice alpiiabet of n degrees 
of freedom by applying n second binary choice parameters to a triplet to yield at least 2" 
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ciasses of subject triplets, wherein the assigmiient provides at least fowr classes of 
triplets distributed in a ratio of about 3:5:3:5; and 

identifymg a proteirt which includes a polypeptide portion encoded by the 
plurality of triplets in the second reading frame; 
5 thereby identifying a protein that includes a polypeptide portion which is 

structurally or iunctionally similai' to all or a portion of the test pmtein. 

In another aspect, the invention features, a method for identifying a mutation- 
prone region of a nucleic acid sequence, e.g., a viral micleic acid sequence. The method 
10 inchidcs: 

providing the micleic acid .sequence; 

assorting bases of at ieast a portion of the nucleic acid sequence into a piurality 
of subject triplets in a first reading frame; 

assigning the piurality of subject iriplets in the first reading frame to one of a 
1 5 piurality of classes, wltercin the assignment is a fisnction of classifying the subject 
triplets of tiie nucleic acid sequence under a binary choice alphabet of n degrees of 
freedom by applying n binary choice parameters to a triplet to yield at ieast 2" elas.<}es of 
subject triplets, wherein the assigninetn provides at least four clas.ses of triplet? 
distributed in a ratio of alx>ut 3:5:3:5; 
20 assorting bases of the at least a portion of the nucleic acid sequence into a 

plurality of subject triplets in a second reading frame; and 

assigning tite plurality of subject triplets in the second reading frame to one of a 
plurality of classes, wherein the assignment is a flinction of classifying the subject 
triplets of the nucleic acid sequence under a. binary choice aiphahet of n degrees of 
25 freedom by applying n binary' choice parameters to a triplet to yield at least 2" classes of 
subject triplets, wherein the assignment provides at least font classes of triplets 
distribiited in a ratio of about 3:5:3:5; 

thereby identifying a mutation-prone region of the nncieic acid sequence, 

30 In another aspect, tlie invention includes, a method of providing a protein 

structure, e.g., the structure of a protein of known function, in which one or a plurality of 
amino acid residues are changed, 'flie metliod includes: 

providing a nucleic acid sequence which encodes a candidate protein structure; 
evaluating the sequence by a method described herein; and 
35 altering one or a plumlity of amino acid residues in the candidate protein 

structure, 

thereby providing a protein .structure. 
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in yet another aspect, tlie iriventioii features, a machmc-re-adable <feta storage 
mediuin, including a data storage materiai encoded with machine readable datit which, 
when used with a machine programmed witli iiistructionvS for using tiie data, is capable of 
5 storing, retrieving, or displaying databases, binary choice alphabets, protein sequences, 
nucieic acid sequences of the invention- The storage medium can be med in metJiods of 
the invention. In preferred embodiments the storage medium is recorded with: a class 
constant nearest neighbor table; the classes into which the triplets of a nucleic acid are 
assigned; or nucleic acid sequence wliich encodes or protein structure which is to be 
10 anaiyyed or which has been altered by appl jcation of a method d&scribed herein. 

Methods referred to herein can further inchjde creating a record of one or more 
profein structures lo be analyzed or modified, e.g., proteins, protein portions or 
iragtnetit.s. or nucicK: acid.s whicii encode all or psrt of such protein structme. fhe 
15 protein or nucleic acid structure which is to be anaiyzed or modified, or the structure 
which has been identified, evaluated or modified, or both, can be recorded. The record 
can be encoded in the fomt of a machine-readable data borage medium. The recorded 
structure, e.g., a nucleic acid or amino acid .sequence, can be displayed on a machine, 
e.g., on a monitor, or in printed form. 

20 

Methods referred to herein can further include providiiig an identified or 
modified substance, e.g., a protein or nucleic acid, e.g., chemically synthesizing the 
identified substance based on the structure ideittified by way of the metht>ds described 
herein. In preferred embodiments, the method includes assessing the biological activity 
25 of the identified substance. The biological activity of the identified substance can be 
a.ssessed in vitro or in vivo. In preferred embodiments, tlie identified substance can be 
combined witli a carrier .suitable for introduction into any living cell or organism, c-g.^ an 
animal model, e.g., natm-aily derived or syntlietic po!y.me.rs, solvents, dispersion media, 
coatings, antibacterial and antifungal agents and the like. 

.30 

Methods refetTed to herein can further include providing a three dimensional 
representation of the protein structure, or a representation of the primary sequence of the 
protein structure, cither before or after a modification. The structure can be compared to 
the candidate structure or can be evaluated for the ability to exhibit a predeleittiined 
35 structiffc, e.g., possession of a structural component such as a helix, or a turn segment, 
an activity, e.g., the ability to dock with a second protein. 
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in methods referred to herem Ihe nitcicK ticid s.-qutiKc tan be mv ot a L-orom^ 
sequence; aa mRNA sequence; a sequence which encodes a protein su lii-t'tjc oJ kiio\\n 
funclion; a sequence for which {he reddnig fjanic. if it exists, is known, a s^jqi eiiLC lor 
which the reading frame, if it exists, is unknown; a sequence whieh mciudes a coding 
5 portion; a sequence which includes a non-coding portion; or a sequence from a 
multiprotein data base. 

Methods of the invention allow a wide variety of information to be exti^cted 
from nucleic acid sequences and allow a wide variety of useful manipulations, e.g., the 

10 identification of useful protein structures and tlie design of improved or aiteted funciion 
protein structm^es. These include, but are not liniiied to: 

providing a protein stnicture encoded by codons of a fnsl subcode winch lias a 
predetermined property of a protein structure encoded by codons of a second ciass or 
subcode. This aOovv's: provision of a protein siructnre having a novel amino acid 

15 sequence but which has a desired property, e.g., seconday structure, of a known protein; 
provision of protein structia e wi di impro ved or aUered .{unction; 

identifying regions of proteins whkh are encoded by runs of codons of a single 
class Of subcode, thereby identifying region.s which have been resistant to evoiutionary 
or mutational change and which may therefore be functionaily important; 

20 identifying a critical amino acid residue(s) in a protein structure by identifying 

"minority codons" in runs encoded by codons of a single class or subcode, tiiereby 
identifying amino acid replacements vviiich, although disfavored at the mRNA level, 
exhibit sufficiently favored characteristics at the protein level that they have been 
maintained and may therefore be functionally important; 

25 determination of nearest neighbor relationships based upon nearest neighbors 

encoded by codons drawn from the same clas<; or subcode; 

distinguishing a coding region from a non-coding region by determinhig whether 
the region obeys nearest neighbor relationships involving codons dfav«i from the same 
class or subcode; 

30 assignment of function (or stmciure) to a proteiii or polypeptide of unisnown 

stnicture by recognizing codon patterns in message-level nucleic acid which encodes the 
protein or polypeptide stmcttire of miknown function (e.g., the protein or pt^lypeptide is 
encoded in a first subcode) similar lo codon patterns in message-levei nucleic acid which 
encodes tlie structui^e of known function (but ditTerent primary sequence) (e-g., which is 

35 enctxied by a second subcode). 
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DEFINiTiOm 

As used herdiif "protein structure" lefers to a structure of at least two amino 
acids linked by a peptide bond. A protein structure can include an entire protein, or a 
5 pan thereof. For example, a protein structure can include a domain or other region 
having a characteristic structural, chemical, or biological property, Exampies of 
structural elements include helices, turns, sheets, helix-turn structure; teniarj' amino acid 
structure; and the like. Exampies of chenwcal properties include net charge, side chain 
bulk, side chain charge, acidity, nuclei>phiiicity, hydrophobicity, and the like, Exampies 
10 of biological properties include catalytic activity, promoter or suppressor activity, ability 
to bind to or interact with a second njolecule such a,s DNA, RNA, a proteio, a metal 
atom, immunologicai activity, and the like, Hxamples of known domains which can be 
included in protein sinictures include: zinc fingers, binding regions, and tiie like. A 
protein stmctural clement can be from a naturally occtirring protein or can be a non- 
15 naturally occurring (e.g., a novel) construct- 1 lie proicin structure can be of a 

predeterniined iengtii. In preferred erabodinK^nts it is at lea,st S, 16, 32, 64 or 128 amino 
acids in length. 

As used herein, a predetermined property is a property other than the sequence of 
amino acids, and can include one or raore of the following; (1) three dimensional 

20 structure, e.g., secondary structure, tertiary structure, or quaternary simclure; (2) a 

charge-related property, e.g>, due to positively or negatively charged side ciiain residues, 
including, but not limited to: the presence of a predetermined charge at a piedstemiined 
location in the sequence, the net charge on a protein or polypeptide, and the like; (3) 
hydrophobic ity, e.g., due to the presence of water- insohible side-chain residues; (4) an 

25 activits' associated with an intramolecular interaciion or an inlermoleculai- interaction, 
intermoieeular interactions include binding activity, catalytic .activity, and the iike. 
An "amino acid alphabet," as used herein, refers to a group of codons which 
enc<Kie amino acids or ,stop codons, 

A,? used herein, a "binary choice" amino acid alphabet of n degrees of freedom, 

30 refers to an amino acid alphabet which is structured into 2," subcodes, by the application 
of binars' choices dictated by n choice parameters, and v\*ere a choice paiBmeter is a 
fimctionof nucleic acid sequence and/or codon patterns of the nucleic acid (e.g., an 
mRNA). 

A "binary choice parameter" or "opposition," as used herein, refers to a 
35 paratneter by which a poiynueleotide eodon or triplet can be assigned one of two values . 
The assigned values allow the triplets to be assigned to classes. It will be appreciated 
that application of more than one non-degenerate binary choice parameter can divide 
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itnkfs tnto nioro tlnn t\^,o <.la&st" Ihcdivi&ion mio d jvcs t..jn nn, bx^i.d on i 
prtoele iniiKii \<ilnc i ali tnplefs v\^{■l a value les-^ ■'buin tht picdi- icu ime ' saiut m. 
m ont. t.?.is\ and all with \ alue<; above the predeteimmed values are m d second Sass, or 
ail triplets having predetemimecl charftctenstic a, e.g.. being pynraidme-nch, are m a 
5 first class and all codons being pynmidiue- poor are m a second class. 

The terra "codmg modahiy,' as used herein, reiers to a pattern of codon usage in 
a nucleic acid message, e.g., the frequency that one or more codons appears in a micieic 
acid sequence, the relative frequency tliat one or more codons appears in two or more 
reading frames of a nucleic acid message, and the like. 

1 0 A "Iripkt", as used herein, refers to three contiguous (sequential) nucleic acid 

residues (e.g., read in the 5'-3' direction along the nucleic acid .'strand). A triplet can be a 
codon (e.g., when a coding nucleic acid sequence is read in the coding frame) or can be a 
non-reading frame iripiet or tK>n--codmg triplet. 

A leading triplet, as used herein, refers to a triplet which i.s 5' To the most 3' base 

15 in the subject tripie. Tims, in a sequence 12345, the leading nipiet is 123, 

A final triplet, as used herein, reisers to a triplet which is 3' to the most 5' base in a 
subject triple. Thus, in a sequence 12345, tiie final triplet is 345. 

A class of triplets, as used herein, refers to ail triplets wiiich fail within a 
particular subgroup of triplets under a selected binary choice alphabet 

20 A message-level property, as used herein, refers to a propeity of a nucleic acid 

(e.g.,. mRNA) of three or more bases in length, which property is other than the identity 
of or physical or chemical property of an amino acid (or punctuation) encoded by the 
nucleic acid (wherein such physical and chemical characteristics include, e.g., .size, 
hydrophobicity, hydrophilicity), ajid is otlter than codon-bias. Structural message-level 

25 properties include physical and energetic properties of the nucleic acid. Examples 
include: [JA-ricl) triplets vs. CG-rich triplets; UG-rich triplets vs. AC-rich ti-iplets; 
purine-rich ("R-rich") triplets vs.. pyrimidine-rich ("Y-rieh^') triplets; assigning a 
plurality of codons in said sequence to (!) either a Y-rich subcode or an R-rich subcode 
and (2) to either an E-rich (UG-rich) subcode or an M-rich (AC-rich) subcode. 

30 Compositional message-level properties include frequencies of particular codon groups 
in one or more reading frames of a message. 

The term "reading frame" is known in the art and refers to a frame for reading, 
e.g., translating, a nucleic acid message. For example, a sequence of nucleotides 
123456789 can be read in three reading frantes (e.g., in groups of three nucleotides, each 

35 triplet being a codon): Reading Frame 1 : 123 456 789; Reading Frame 2: 234 567; or 
Reading Frame 3: 345 678. 
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a protcjn or polypeptide, f-or example, evaluating protein structure mcludes: 
deieifitimnu ihs thiw-amien.siuiiai .struvUirsj of a proictn orpolypcpuUe, ».onjpantu; the 
three-dimensiona! striicturc ofa knowo protein or polypeptide with that oi\ut ii!ikuo\vrt 
protein or polypeptide; determining the function ofa protein or po!> peptide, coiuijarmg 
the ilmction ofa known |H-otetn or polypeptide with that of an unknown protein or 
polypeptide; and the like. 

Other features and advantages of the invention will be apparent firom the 
foibwing detailed description, and from the claims. 

Brief Description of the Drawings 

Figure 1 schemattcalJy depicts alternate reading frames for a nucleic acid 

message. 

Figure 2 depicts the distinction between "wildcard" and "constant" codon 

doiibiets. 

Figure 3 shows the 64 codons divided into four groups based on the "wildcard" 
and ''constant" distinction and the leading base of the codon. 

Figure 4 shows the frequencies of codons in the groups of Figure 1 in a test 
inRNA database. 



Detailed Description 

In generalj the invention features, a method of evaluating protein structure. The 
25 method includes: 

providing a nucleic aod sequence which eiici.)d(js> the pioteiri .sirocture: 
assorting bases of the n.ic'^sc rc.c -.euucncc uno suhicci tnpicts. and 
as.signing one oi a p\u.\l ; <>! n.!.> . 'ttpict;, tu ow ai <i pkuahty of classes, 
wherein the assignment is a funct.o" t : i,ja'5»!i>ing triplets, c . a >>ui:>icct triplet m a 
30 leading and following triplet of (hu subioct triplet, of tht- nucien; ,,icid .stqiicnvC 

members of a class ofa binary choice alphabet of n det'i ees of iieedom nmi uhat-t i ihc 
classes cai^' be generated by applying n binary choice ptirametets to a ti iplct to > ie!d at 
least 2" cla.sses of .subject triplets, wherein a bmai-y choice paraoieier is a ttinctioa ofa 
message-ievel property of the nucieic acid sequence, 
35 thereby evaluating the protein structure. Triplets can be assigned to a class 

based on whether they satisfy a value for tlie message level property, e.g., a triplet catt be 
assigned to a class based on whether its value for a parameter is above or below a 
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predefined value, or whether or not it possess a particular characteristjc, e,g., whether it 
is GC rich. 

The message-level property: is other than* the idenlfty of the amliso acid or 
5 punctuation which a triplet encodes and is other than codon bias. 

In preferred ernkKUments the method indiides making a record^ e.g., ori a 
machine readable medium, of ti>e class assigned to one or more triplets. 

1 0 Iti preferred embodiments: n is chosen from the integers { , 2, 3, and 4. 

in preferred embodimem.s the message-level property is a function of a physical 
or chemical properJy of one or more bases of a nucleic acid; is a function of a physical or 
chemical property which affects the tendency of a nucleic acid to form secondary 
15 stnictiire. 

In preferred aiiihodiments triplets are assigned to a jGrst and a second class: 
the first class having the property tltat a message made of triplets drawn 
exclusively from tlie first class is less Hkely to form secondaiy (intrachain) structure than 
20 is a message which is made of triplets fbom botit the firet dass and the second class of 
triplets, and 

the second class having the property that a message made of triplets drawn 
exclusively from the second class is less Hkely to form secondary (intrachain) structure 
than is a message which is made of tripiets from both the first class and the second class 

25 of tripiets. 

In preferred embodiments the message-level property is: a function of the UA 
content of a subject triplet; a function of the GC content of a subject triplet: a function of 
tije size or moiecuiar weight of a tiiplet; a function of vvheiher the triplet is keto rich or 
30 amino rich; a function of whether the triplet is purine rich or pyrimidine rich; a function 
of a the etithaipy of the interaction between tlie triplet and a fully or partially 
complememar>' nucleic acid. 
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In preferred embodiments: the binary choice parameter is applied to the subject 
triplet, e.g., applied to the codon which encodes an amino acid, to place a subject triplet 
in a class. 



wommu 



\n jjteif;rtvfi eu^lxiciinients: iIk- clr:^^ into whkh a subject triplet is assigned is a 
funclion of: 

(1 ) providing a value for a subject triplet of bases 456, wherein tlie value is a 
function of the application of a binary choice parameter to a first set of con tiguous bases 

5 which includes all or a subset of the bases of the subject triplet, e.g., bases 4 and 5 and of 
the application a faiftar>' choice parameter to a second, different, set of contiguous bases 
which includes all or a subset of tije bases of the subject triplet, e.g., bases 5 aiid 6; and 

(2) assigning a plurality of subject triplets to a first class, and a plurality of 
triplets to a second class, as a function of subject triplet value, 

10 

in preferred embodiments: the class into which a subject triplet is assigned is a 

iimciion of; 

(1) providing a value for a subject triplet of bases 456. wherein the value is a 
function of the application of a binary choice parameter to a first subset of the bases of 

15 the subject triplet, e.g., 4 and 5, and of the application a binaiy choice parameter to a 
second, different, subset of the ba.ses oi die subject triplet 

(2) assigning a plurality of subject triplets to a first class, and a plurality of 
triplets to a second claSvS, as a function of subject triplet value. 

20 In preferred embodiments: the class into which a subject triplet is assigned is a 

function of: 

(1) providing a value for a subject triplet of bases 456, wherein the value is a 
function of (S' + S2)/2, wherein a function of the application of a binary choice 
parameter (e.g., the value for enthalpy of anticodon-codon formation above or below a 

25 predetemiined value) to a first subset of the bases of the subject triplet, e.g., bases 4 and 
5 of the subject triplet, and is a flinclion of the application of a binar>' choice 
parameter to a .second, different, subset of the bases of the subject triplet, e.g., bases 5 
and 63 of the subject tripiet> and 

(2) assigning a plurality of subject triplets to a first class, and a plurality of 
30 triplets to a second class. 

In preferred embodiments: the class into which a subject triplet is assigned is a 
iunctiott of the application of a binary choice parameter to one or both of a leading 
triplet or a final triplet of the subject triplet, 

35 

in preferred embodimehts: the cla.ss into which a subject triplet is assigned is a 
i\inction of: 



is a function of (S^ h S^l 2, wht^iciii S' the \alue, c g , c"julkiip\ , ot thv.- na^e p.ut 
doublet 45 of the subiect triplet, and is the value, e.g.. enthaipv, of the Dase pair 
doublet 56 of the subject triplet- and 
5 (2) assigning a plurality of subject triplets to a first class, e.g., a low enthalpy 

class, and a plurality of triplets to a second class* e.g., a high enthalpy class. 

In preferred embodiments: a subject triplet 456 of a nucleic acid sequence of 
bases 123 456 789 is assigned iiUo a class as a function of: 
10 (I) performing one or more of (i), (ii), and (iii) 

(1) applying a binary cl^oice paiameter lo a leading triplet of 456, e.g., to one or 
more oFtriplet 123, 234, or 345, to yield a ieadiiig value; 

(ii) applying a bimiry choice parameter to 456, to provide a center value; 

(iii) applying a binary choice paiametcr to a foliowing triplet of 456, e.g., to one 
1 5 or more of triplet 367, 678, or 7S9, to yield a following value; 

(2) assigning one or a plurality of subject triplets 345 into a class based on. the 
values determined in one or more of ( 1 ), (3) and (3). 

thereby assigning one or a plurality of subject triplets into classes. 

20 In preferred embodiments: the class into which a subject triplet is assigned is a 

fmietjon of tlie application of a first binary choice parameter to a leading triplet and a 
second binary choice parameter to a following triplet of a subject triplet. 

In preferred cmbodinicnts: the evaluation includes determining if the nucleic 
25 acid .seqtience includes a run of iripiets, e.g., a run at least 20, 40, 60, or 1 20 tripiets in 
length, in which at least 20, 40, 60, 80, 90 or 05 %, or all. oi'th^ triplets in the ran arc 
from a lirst class. The method allow s for evaluating a proteiji structure for resistance to 
change, e.g., evolutionary or mutatiofjal change, by identifying regions of the protein 
which structure encoded by a nui of a single class or subcode, thereby identifying 
30 regions which have been resistant to change and which are thei-efor predicted lo be 
fhnciionaily or structurally significant. In preferred embodiments a codon, preferably 
vdthin the run, is changed so as to alter the sequence of the encoded amino acid to 
provide an altered sequence. 

35 In prefernsd embodiments; the evaiuation comprises identifying a triplet from a 

fc{ class in a run of triplets of a seamd class, e.g., a ran at least 20, 40, or 60 codons in 
length, in which at least 20, 40, 80, 90 or 95 %, or all, of the codons are from the second 
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class, thereby ideiitifying the tnpici o( ihc iltst cJass as LOcudip^' i cnf k ! i .sidut l ^ i 
Structure or function critical residue. In a prtferred en^boJim^m uxlon chann-d ' i 
as to alter the amino acid encoded by the critical residue, a residu^ fld}ai.cut *o the ci itn_<il 
residue, or a residue which interacts witii tiie cntjcal residue, and thereby provide an 
5 altered sequence. 

in preferred efflbodinients; the nnckic add encodes a proteiii structure of knowii 
or unknown function. 

10 in another aspect, the invention features, a ciass-conslant table of nearest 

neighbor relationships for amino acid residues wWch provides, for each of a pluraiit)' of 
class constant nearest neighbors, a frequency of occidtcdcc which is a function of the 
occurrence of tlie class constant nearest neighbor pair in a collection of protein 
Structures;, e.g., a collection of at least 10. 50. 100, or 500 protein.s. 

15 

The class constant table provides a measure of the frequency witii wiiich a .first 
and a second amino acid occur as nearest neighbors and wherein nearest neighbor 
frequencies are determined within a codon class, and wherein a class is a function of a 
message level property of a nucleic acid, e.g., the eodon, which encodes an amino acid, 
20 The class can be any class generated by the binary choice paraineter-ba.sed methtxi.s 

referred to herein, For example, if tlte clas.ses are a ilrst class, e.g., high enthalpy codons 
and a second class, e.g., low enthaipy codons, the table is generated for nearest 
neighbors where both neighbors are encoded by codons of either the first class or codons 
of the second ciass. 

25 

In preferred embodiments: the assignment of atnino acids into a class is done by 
assigning a codon which encodes it into a class as a function of classifying triplets, e.g., 
the subject codon or a leading and following triplet of the subject codon, as a member of 
a binary choice alphabet of n degrees of freedom by applying n binary choice 
.30 parameters to a triplet to yiekl at least 2" classes of triplets, wherein a binary choice 
parameter is a tnnction of a message-level properly of the nucleic acid sequence. 

The table can be teeorded on a machine readable medium. 

3 5 In another aspect, the invention featares, a method of evalnating a protein 

stnicture. The method includes: providing a ciass-constant table of nearest neighbor 
rehrtionships for amino acid msidues; 
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providing a nucieic acid which encocies a protein structure; and 

comparing one or a phiraHty of the observed nearest neighbor pairs in the protein 

stmcture with the frequeacics provided by the class ctinstant table, thereby evjiluaiing 

the protein structure. 

5 Ttte class constant table provides a measure of the feiiuehcy with which a first 

and a second amino acid occur as neatest neighbors imd wherein nearest neighbor 
frequencies are determined within a codon class, and wherein a cJass is a function of a 
message level property of a nucleic acid, e.g., the codon, which encodes an amino acid. 
The class caji be any class generated by the binary choice parameter-based methods 
10 refen-ed to herein. For example, if the classes arc a first class, e.g., high enthalpy codon.s 
and a second class, e.g., low entiialpy codons, the table is generated Ibr nefjiest 
neighbors where botii neighUws are encoded by codons of either the first clas,*5 or codona 
of the second class. 

15 In preferred ernbiadimeflts^ the comparison can inckide: assigning an expected 

frequency from the class constant table to otie or a pliHBlity of the observed nearest 
neighbor pairs and determining how many of the observed nearest neighbor paii^ Ml 
above or below a predetennined value; determining the likelihood of occurrence, as 
predicted by the class constant table, for an observed nearest neighbor pair.; or 

20 detentiining if an observed nearest neighbor pair of a first and a second amino acid 
residue from the protein structure is predicted by fbe class constant table to occur at a 
predetennined frequency. 

In preferred cmbi dtments: ihc assignment of ammo actds int.> a clas^ is done by 
25 .isstgnmg a codon ^\ hich encodes it into a cla.-is as a function of classif% uiy inplets, e.g., 
the subtcci wdm or a Itaduig and tolloumg triplet of the ->tibjeci codon. as a tnember of 
a bmarv choice alphabox oi n aci^riOs of fjcedom b\ applying n binary choice 
paranuiLrs to a inplt-i lo \ at least 2" masses ol triplets, whcicm a hnian choiee 
paramerei a fimctior o^ j rt. "s sjj-vCve) ptopcriy of the nucleic acid sequence 

30 

In preferred embodiments the metliod includes making a record of observed 
class constant nearest neighbors in the protein stnictnre on a machine-readable medium. 
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In pseferred embodiments: the method further includes deterinirang if an 
observed nearest neighbor of the protein structure is that pFedicted, at a predetermined 
frequency, by the table, thereby evaluating the protein structure. 
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f o pi ■fR;r[ i.xl c-rtihodimenls the metiiod can bt; used to identily coding regions in a 
nucleic :5cid sequence. A coding region can. be identified hy comparing oi^served nerirest 
neighbors in the protein structure with a class constant nearest ncigiibor table, the 
presence of observed pairs which correspond to predicted pairs in the tabJe being 
5 predictive of a coding region. In a preferred embodiment, a codon in the coditig region 
is changed so as to alter its encoded amino acid. 

In preferred embodiments the method can identify structure or function critical 
residues, the occurrence of a nearest neighbor of low probability being predictive of a 
10 critical amino acid residue. In a preferred embodiment, a codon is changed so as to alter 
the amino acid encoded by the critical residue, a residue adjacent to the crittcai residue, 

or a residue which intct-acts with the critical residue, 

in prefcn-ed embodiments: the protein structure is from a protein of k.nowa or 
15 unktiown function. 

In preferred embodiments; the pfotein structure is evaltiated ff>r the presisttcie; of a 
first nearest neighbor with a predicted occurrence below a predetennined value which is 
located in a run of residues, wherein at least 20, 40, 80. 90 or 95% of the residues in the 
20 run are members of nearest neighbors pairs having an expected frequency from tite table 
of greater than a predetermined value, thereby identifying a critical residue, in a 
preferred embodiment, a codon is changed so as to alter the amino acid encoded by the 
critical residue, a residue adjacent to the critical residue^ or a residue which interacts 
with the critical residue, 

25 

In preferreti embodiments: the nearest neighbor includes or is adjacent to a 

critical residue. 

In aiiother aspect, the invention includes, a machine-^readable medium on which 
30 is recorded a class-constant nearest neighbor table. 

in another aspect, the invention features a method of evaluating a protein 
structure for resistance to change, e.g., evolutionar>' or mutational change. The method 
includes; 

35 identifying regions of a protein which is encoded by nms of a single subcode, 

thereby identifying regions which liave been resistant to change and which are therefor 
pi-edicied to be functionally or structurally significant. E.g., the method can include 
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tictenutniitg tf tli^- nticieic acid .scijiiciicc vthich tjticodes the protein simcujie includes a 
run oftrjpk'K, e g., a ran at k'di>i 20, 40, 60, or i20 triplets in length (or e.g., ) 0, .i2, 48, 
tA. I :fi, ur 256 :rtplcis m lengili), in which at least 20, 40, 60, SO, 90 or 95 %, or ali, of 
the triplets in the run are from one class. Any of the ways of generating classes 
5 described herein can be ussed in this method. 

In another aspect, the ixiventioti Includes, a method of evaluatihg a protein 
stnicturc for the pfesenee ofcritica! amino acid residues. The method inciudes; 

identifying critical amino acid residues by identifying "minority codons" in runs 

10 encoded by codons of a single ciass or subcode, thereby identifying residues which have 
been resistant to change and which are therefor believed to be fimclioiially important. 
Any of the ways of generating classes descri bed hcreiti can be used in this metljod,. 

In preferred embodiment; the evaluation comprises identifying a triplet from a 
1 5 first class m a s ran of triplets of a second class, e.g., a run at least 20, 40, or 60 codons 
in length, in which at least 20, 40, 60, 80, 90 or 95 %, or all, of the codons are from the 
second class, thereby identifying the triplet of the first class as encoding a critical 
residue, e.g., a structure or function critical residue. In a preferred embodiment, a codon 
is changed so as lo alter the amino acid encoded by the critical residue, a residue 
20 adjacent to the critical residue, or a residue which interacts with the critical residue. 

In another aspect, the invention teature«}, a method for evaluating a protein 

structure. The method includes: 

puAiduij xiuAi. I. dii -^'^i.iui.t <. b encodes the pjotem structure 

25 av^oujua ba^.(,s ut tUv. ^ ^]<. * t-qacnce into snb}<,ct triplet*; and 

awgnme at least on > < ptets lo one of a pkaahls oUia-str- 

xshcrem the as^^ianment IS li •■<.[ 0 o t -^-.hwt .b e^t ftnUn oJ the nutlet acid 

sequence undtt ahnnrv i,hi ki v i^xn>t dt< ^ i i u I'l^ui n buwr\ 

cnoico parameter<i to a triplet \eJatji,a^i2 <- is-,i.so} m,ei.i uipltis vvbs^rLruht. 

30 issti'ument pio\tde-. it kMst tout tiasses of triplets tin. at kast ioui lLis^ls oi tnpkjs 
beutg icprebcnied in ai least a portion ol the nucleic acul scqucnct ni a jauo v^t about 
3:5:3:5; 

thereby evaluating the protein structure. 
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in preferred embodiments n is 1 , 2, 3, or 4. 
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Jn prefemxl embodii^ents the methtxi includes making a record, e.g., on a 
machine-readable medium, of the class assigned to one or more triplets. 

In prefeited smbodimetas, the classes eati he g«iierated by application of a hitiaj y 
5 choice paratiieter referred to herein. 

In anbtlier aspect, the ijiventtoii features, a aiethod for identifying coding regions 
of a nucleic acid sequence, the method comprising; 
providing ihe nucleic acid sequence; 
10 assorting bases of at least a portion of the nucleic acid sequence into s plurai ity 

of subject tripfets; 

assigning the plurality of subject triplets to one of a plurality of classes, wherein 
the assigmnem is a function of classifying the subject triplets of the nucleic acid 
sequence under a binary choice alphabet of n degrees of freedom by applying n binary 
15 choice parameters to a triplet to yield at least 2" classes of subject triplets, wherein the 
assigmnent provides at least four classes of triplets A, B, and D; 

determining whedier the plurality of subject triplets; are distributed into the at 
least four classes of triplets A:B;C:D m a ratio of about 3:5:3:5; 

tiiereby identifying coding regions of the nucleic acid sequence. 

20 

In preferred embodiments n is !, 2, 3, or 4, 

In preferred embodiments (Ah-B>'{C -^ D) is about one. 

25 In preferred embodiments (A+D)/(B+C) is about one. 

In pi-eferred embodiments the metliod includes making a record, e.g., on a 
machine-readable medium, of the class a.ssigned to one or more triplets, 

30 In preferred embodiments, the classes can be generated by application of a binary 

choice parasneter refeixed to herein. 

In another aspect, the invention fealufes, a method for identifying a proiein that 
includes a polypeptide portion which is structurally or functionally similar to all or a 
35 portion of a test protein, the method comprising; 

providing a nucleic acid sequence which encodes ail or a portion of the test 
protein; 
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j^sortuic; ba^cs uJ ,u Lm^I <j nouion (if the nucteic acid sequence into a plarality 
of suhK'Ct ir-pkh m a fust reaJuig fra-ne, 

assigning the jjlurahiy of siibjijct iripleis in the first reading frame to one of a 
plurality of classes, wherem the assignment is a iimction of dassifying the subject 
5 triplets of the nucleic acid sequence under a first binary choice alphabet of n degrees of 
freedom by applying n first binary choice parameters to a triplet to yield at least 2" 
classes of subject triplets, wherein the assignment provides at least four classes of 
triplets distributed in a ratio of about 3:5:3:5; 

assorting bases of the at least a portion of the nucleic acid sequence into a 
1 0 plurality of subject triplets in a second reading frame; 

assigning ihe plurality of subject triplets in the second reading frame to one of a 
phiraHty of classes, wherein the assignineiu is a function oFctassifying the subject 
tripieis of the nucleic acid sequerice under a second binary choice alphabet of n degrees 
of freedom by applying n second binary choice parameters to a triplet to yield at least 2" 
1 5 classes of subject tripiets, wherein tiie assignment provides at least four classes of 
tripieis distributed in a ratio of about 3:5:3:5; and 

identityiag a protein which includes a polypeptide portion encoded by the 
piuratity of triplets in the second reading frame; 

thereby identifying a protein tliat includes a polypeptide portion which is 
20 structurally or liinctionally similar to ail or a portion of the test protein. 

In preferred embodiments each of the first and second binary choice alphabets, n 
is 1,2, 3, or 4, n is two. 

25 In preferred emboJuneni.s { A-'-Ei V' C M>) 5b about f>nc, 

in preferred embodiments ! A+D)/{B+C > is about one, 

in preferred cmbodsnients the frm reading frame is frame 1 and the second 
3 0 reading fiame is frame 2 or 3 . 

In preferred embodiments the method includes making a record, e.g., on a 
machine-readable medium^ of the class assigned to one or more triplets. 



3 5 In prefen-ed embodiments, the classes can be generated by application of a binaj-y 

choice parameter referred to herein. 



{lOhpi-pUdc poitioii LiHiidcd b\ tho pliuanh ol lnp!a> in the -^tconJ it-utin;' tt «no 
(.ompt i^es rcadinji a\\ or a portion ot a proieni s^tqucncc iroxn d djtabave oi pfoicia 
sequences. 

5 

In another aspect, the invention fcatiires, a method for identifying a mutation- 
prone region of a nucleic acid sequence, e.g,, a viral nucleic acid sequence. The method 
includes: 

providing the nucleic acid sequence; 
10 assorting bases of at least a jx^riioti of the niideic acid sequence inio a plurality 

of subject triplets in a ilM reading frame; 

assigning the pluraiity of .subject triplets in the ft rsT reading frame to one of a 
plurality of classes, wherein the assignment is a ftmction of classifying the subject 
triplets of the nucleic acid sequence under a binary choice alphabet of ii degrees of 
1 5 fteedom by applying n binary choice parameters to a triplet to yield at least 2" classes of 
subject triplets, wherein the assignment provides at }ea.st four classes of triplets 
distributed in a ratio of about 3:5:3:5; 

assorting bases of the at least a portion of the nncleic acid sequence into a 
plurality of subject triplets in a second reading frame; and 
20 assigning the plurality of subject triplets in the second reading frame to one of a 

plurality of classes, wherein the assignment is a function of classifying the subject 
triplets of the nucleic acid sequence under a binary choice alphabet of n degrees of 
freedom by applying n binary choice parameters to a triplet to yield at least 2" classes of 
sijbject triplets, wherein the assignmejat provides at least four classes of triplets 
25 distributed in a ratio of about 3 :5:3;5; 

thereby identifying a mutation-prone region of the nucleic acid sequence. 

In preferred embodiments the nsethod inciudes making a records e.g., on a 
machine- readable medhim, of the class assigned to one or more triplets. 

30 

in preferred embodiments, the classes cati be generated by application of a binary 
choice parameter referred to herein. 

StrgeuiralBm^ ^^ 

35 Straetura! binary ciioice parameters can be selected from a variety of phy.sical or 

physico-chemical qualities related to the structure of tiie nucleic acid (polynucleotide) 
sequence, including the primary or secondary structure of the nucleic acid sequence, the 
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ph\ML.'i or vhcm^t-a! uaiuto C the imd o dt. f>t tht.- p!i^ st*.ai m cbemiva) tuturt 
Ihc tittioii^ arj thi. hU Urns, lot e\ >t i . pi ic"..s ula (.J ti the ahibrv o. <i it a kn. 
duJ sec'uoncc U> form secondar> strui;ture, c g hv hndi/ation ot t.nbsctju^'ot-tt'i t i tin" 
nudeic actd sequence, can be selected as himrs tliojce p<« Hiietejs I ui t^anipk (iit 
5 seif-pairing of a nucleic acid sequence couid be greatci ui t l , a lughK V \ (or (.>(. )- 
rich region of the nucleic acid, while a nxjcieic acid ^hxch is not UA (or GC)-rich would 
be less prone to self-pamng. 

Otiier exemplary binary choice paiitmeters include the size of the nucleotide 
bases (e.g., pyrimidine vs. purine), H-bonding qualities due to H-bond donor or acceptor 

10 suhstituents (e.g., amino vs. keto-coniaining nucleotide bases), and the like. 

Binary choice parameters can also be related to .selected properties of codons, 
inchtding the relative enthalpy of codon-anticodon interactions (which can include the 
relative enthalpy of the interaction of a codon with iis anticodon pins the flanking 
complementary bases, e.g., the relative enthalpy of pentaiiK-rs with their anuparallei 

15 complements: the ability of a codot^ to be "read" by a tRNA ( which can be related to 

codon-anticodon interaction enthalpy, .size, polarity, and the like), and other such codon- 
level parameters. However, a codon-levet parameter is not a function of the amino acid 
encoded by the codon. 

20 Composttional Binary Choice Parameters 

Compositional binary choice parameters can be selected from observed 
frequencies of certain codon groups tit one message reading friune and/or correlations 
among frequencies of particular codon groups in two different reading frames of the 
same message. Compositional choice parameters include those derived from enthalpic 

25 and statistical analysis of mRNA pentainers; compositional choice paranietcrs also 

include any derived from energetic and statistical analysis of oiRNA n-raers (i.e., n > 3), 
where such analyses can be shown to yield constant intra- and inter-frame frequencies of 
particular codon groups. 

30 

Application oi ijmary (.. hoicc Parameters 

The application of a first binary choice parameter, e.g., with choices a and b, will 
.structure the triplets into classes (or subcodes) a and b. The application of a second 
choice parameter, with choices c and d, will stnicture the triplets into classes ac, ad, be, 
35 and bcl. Application of a third binary choice parameter would structure triplets into 2^ or 
eight subcodes. Thus, application of n binary choice parameters to the genetic code will 
result in the formation of a binary choice alphabet having 2^ classes or subcodes, it is 
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possible that sonn.' .'^ubcot.tcs will be empty when the bmary choice alphabet is applied to 
a giwn nLicJcic acid ■^oqucncc. 

The bir.ai y choice parameter can be appiied directly to a subject triplet lo assign 
triplets into a class, tor example, the binarj- choice parameter can be baml «pon 
relative enthalpy of a codon-anticodon interaction {e.g,, the a>dons are divided into 
group(s) of codons having high relative enthalpy and group{s) of codons having low 
relative enthalpy) and that parameter applied to a subject codon such as 234, A subject 
triplet can also he assigned a class by a method in which bases are not in the subject 
triplet, or which do not correspond exactly to the bases of the subject triple. E.g., the 
binary choice parameter can be applied to one or more base pairs which do not defme the 
triplet. E.g, in evaluation of triplet 234, the binary' choice parameter c^in be applied to 
triplet i23 and triplet 345, and the classes into which Ihe triplets 123 and 345 fail can foe 
used to assign a class or subcode to the triplet 234. In other words, the subcode of 234 
can be a function of the appiication of the binary choice parameter to the triplets 123 and 
345. 



Ff ante Choice 

Methods of the invention require the divi.sion of a sequence of bases into triplets. 
"Ilie simplest way is to consider a string of bases, 123456789, as triplets of 123 456 789. 

20 Mechant.sticaHy, this or any mode of division into triplets can be viewed as a process 
with two components, a "ratchet" or advance component and a "read" or selection 
component. As wili be seen below, the ratchet component varies by tlie number of base 
pairs advanced after the detennination of a triple. 

The read component refers to the len^h in base pairs, of the segment of base 

25 pairs from which the triplet will be chosen. 

1 he simplest •^i.sit.m that ascd by most cMiiutiooauh cuncnt teiluuir 
mechaniiims. is ' r<in.ht.i ihrecread V* (that is tht; mRi\A j , ad%an^cd, ai rakhcted 
tliiough the rev'dmg nicth im^m '»rn.'t h.'s^^ U j 'i and tlie nic-^sagu rt-aJ h\ tht 
readtng mechanism m gtoup^ ol hrcc Da e- (ont todon) Uthcr ^'v stems howt,' aic 

30 po:>b'ble Without beuif. bound b\ t'.>,<'r u is postulated that other systems ma\ ha\c 
txiatcd m caihcr stagcb m the cvoiutJon ol the ccilulat piotem tran&iation inathtne > hi 
fact, examples of current Irame-shift repressirtg iRN.A's are known. Thus, possible 
alternate systems include "ratchet 3/rcad 5 on center" (in which the mRNA is ratcheted 
into the reading mechanism three bases at a time, and the reading mechanism reads the 

35 group of three bases at the center of a group of five bases in the reading mechanism). If 
a read value is more than 3 (e.g., in a "ratchet 3/read 5 on center" system), then 
additional choices are imposed: the triplet must be selected from the 3+K bases which 
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are read. Ihuii, a string 1 2 3 4 5 6 7 8 9 10 can bs; divided into the foilowing triplets: 
234 i 567 1 8910, which would be generated by reads 12345 \ \ 7*9/01 ] , wherein 
the italicized bases, the on center bases, are chosen. 

For example, read-ratchet mechanisms or configurations can be divided into the 
5 tbiiowing classes: 

C'iass 1 : ratchet 3; read 3 

Class 2 : ratchet 3; read 5 and select tihe center 

triplet, 123^5 
Class 2a: ratchet 3; read 5 and select the leading 
10 triplet, 72345 

Class 2b: ratchet 3; read 5 and select the final 

triplet, 12345 
Class 2c: ratchet 3; read 5 and read any triplet 
Class 3 : ratchet 3; .{Tamcshift; read 5, any triplet 

15 

Class 2 approaches allow the assigninent of a binary choice parameter to a codon. 
234 as a fimction of the binary choice parameter outcome for one or both of 123 and 
345, e.g., 1 23 is classified as UG (k) or AC- (a) rich; and 345 is classified as UG (k) or 
AC-(a) rich, which gives the following possible classes for 234; fck, ka, aa, ak. If, for 

20 example, 123 is k, and 234 is a, then 234 is ka. Note tliat although only one binary 
choice is applied, there are 4 degrees of freedom with regard to 234, because the binary 
choice parameter is applied twice, 

A binarv choice parameter whtch divides triplets mto classes on the basis of 
( ntjialp\ e ^ ot thv' codon-tinlicodon mtcracjon (c g . mto enthalpicaily strong and 

25 cmlulpivalh '.stak ciasi,;s) tt, paitji.uhH\ u^efii! 

^^c Kl-i<itchet contiguraiions Vvhercih the rcni \dhis is eater th m "> make 
possible the LOnli^^r <5ei!s tut s i ppost 1 .o'^'ok o^'i .;i nj>.i t oi tnpkiMnto 
>.! tsse-, b\ b!nat\ (.h(j u ■> <- ' Ms >. in^^i,^-- ' I < h s <, ^ jIiu-- \shich 
IS a tunUion ol tlte bmat> choice paiaincitt t>uti^ome oi one and more prelerabl) both, 

30 ofl23ai^d345. 

Bhiary Choice Alphabets 

A binary choice alphabet can be constructed by .selection of suitable pre-selected 
binarjf choice parameters. For example, binary choice parameters corresponding to 
35 enthalpy (of codon-anticodon interaction), size, polarity, charge, hydrophobicity, etc. can 
be selected and combined in any desired combination to arrive at a binary choice 
alphabet. Alternatively, a binary choice alphabet can he constructed by segregation of 
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t. odv-^nx n-ito 2" L 1 ><■ ' t \ >ut stlecti m ti\c poups based on binaiy choice panimeiers. 
lorc\ampjt V! - 5piJlvsegres?atecodoi-Js into randomly-selected classes to 
crt\nc dhir.a \ thoii^c aipnabci 

However the binary choice alphabet is constmcied, it is generally prefeable to 
5 validate the alphabet to ensure that the alphabet mil be predictive of protein structure. It 
has been found that, in preferred embodiments, valid binarj- choice alphabets having 2*^ 
classes will getieraliy include at least four classes (A, B, C, D) for which the following 
relationships are true when triplets of a nucleic acid sequence are parsed with the binary 
choice alphabet: the ratio A:B:C:D is about 3:5:3:5 (e.g., from about 2:4:2:4 to about 

10 7: 1 i :7; 1 1); the ratio (A-fB)/{CH D) is alxjirt 1 (e.g., from about 0.9 to about 1.1); and the 
ratio {A4-Dy(B-fC) is about i (e.g., from about 0.9 to about 1.1). Thus, in preferred 
enibodiirieats, application of a binary choice alphabet to a nucleic acid sequence which 
encodes a protein wil! yield at leas!, four classes in wiiich triplets are arrayed according 
to these ratios, it is therefore possible to validate a biuary choice alphabet by searching 

1 5 for the appearance of the desired ratios. If tiie ratios are found, then the alphabet may 
have predictive value for protem structure evaluation. If tlie ratios are not found, the 
alphabet may not have such predictive value. It will be appreciated that the presence or 
absence of the ratios provides a useful "check" for a selected binary choice alphabet. In 
preferred embodiments, regions of triplets are checked for lengths which are muUip.les of 

20 16 (e.g., 3+5+3+5-16), sack as 16 triplets, 32 tiiplets. 64 triplets, and the like. Thus, in 
a preferred embtKliment, groups of N x 1 6 sequential triplets (wherein N is an integer, 
e.g., between 1 and 16) are evahiated to determine whetlier the desired ratios are present. 

Another means for validating a binary choice alphabet is by comparing the 
frequency of codon groups when the message is read in one reading frmne (e.g.. Frame 

25 I ) vvuh tht, trcijuentx of the same codon groups when the nie-^sage xs read hi another 
ri'tklint; iiamc tc , 1 raitie it (i has a.ss been juund thai \alid bjnarv chon-C alphabets 
util i^cneraUi uiclude at ica-^t tou' ^.ifsses V li (. P ^ut.h that ihc ftoqucni.'' ot t.odons 
V B C n vuui svNtenuut^ils iu-n" one t ? i=e to >'P7e! eg J oai I lame 2 to Prame 
\ I raniL. 1 to Frame and oi I rai^ c ti' t 'a* ^ " It s-^ lUcri^tore pos<^ihic lo turfitcr 

30 \aiidUv ahnan cnoicc alphabe* -^x se<.r<.i5'm 1 ! . ^xstei atK inttf-trunit; vaiiauon m 
ihc hcquent jcs oi codon gioup-j deitned bj the alphabet ft such s\skmaf [t. \arknioi5 is 
found, titc alphabet ! u^^ ha\e p edictive \alue One ol ordinate skill iii thv at m 
ot the tcachmjjS iicrein, will be able to select uselul bmar>' chokc alphatvis jv^oj Jinij ti. 
these cntena usmg no more tiian routine experimentation. 



W098/1S814 



-26- 



PC'mJS9i/19673 



Extmphs 

Example 1 : Genemtion 0^3 , predictive amino acid alphabet based on biaary choices 
5 which are a funciion of enthalpy of codon-aniicodon interaction 

This example provides a predictive four letter amino acid alphabet {4a3) for the 
repi-esentation of protein primary structures (s^^) from the energetic properties of mRNA 
molecules, i.e., the translation of mRNAs. While not wishing to be bound by theory-, t}>e 
basis for deriving an amino acid alphabet from codon-amiwdon interactions can be 

10 rationalized as follows: if tlie genetic code was not "frozen" prior to the onset of 
irarjslation and the evolution of protein primary siruciure. then the evolutionary' 
trajectory of this code may have been one factor which determined impoctaot properties 
of protein primary structure. Energetics of codon-anticodon interactions may have been 
relevant to the evolution of the genetic code before ribosonics existed, when these 

15 interactions occttrred in an aqueous medium. 

The configuration of the reading frame may also provide a basis tor deriving an 
amino acid alphabet. Figure 1 schematically depicts uvo alternative reading frames for a 
nucleic acid sequence, each reading frame defuung an energetic packet or triplet; each 
nucleic acid base of the message is represented by a black square. Again, while not 

20 being bound by theory, evolution may have favored systems which would allow slippage 
firom fx^e 1 to frame 2, This would impose entropic requirements on the code. It is 
noted that this in system, which permits "slippage", energy packaging may be analogous 
to kunan linguistic systems which peiinit slippage atid routines for assigning syllabic 
stress (f^) and consequent systematic recasting of signifying sound tokens. For 

25 comparison, re&r to Grimm's Law and Venier's Law for Indo-European. 

It is show^n herein that the energies of codon-anticodon interactions pattern 
systematically, that this pattern implicitly defines a particular amino acid alphabet, and 
that this amino acid aiphabet characterizes proteirj primaty stnicture predictively, i.e., 
provides insight into protein secondary and teritary structure. 

30 Table 1 A shows the Omsiein-Fresco .'iH values for the 1 0 possible base pair 

overlaps, L'sing those Al l values, , average AH values were calculated for all 64 possible 
codon-antkodon interactions for all possible mRNA pentamers (or "five-envelopes'") 
with codons as the center three bases. The average AH values shown in Tables IB and 
iC assume that there is no wobble pairing of codon and antkodon, The average AH 

35 values in Tables IB and C were calculated according to the following formula: tor any 
pentamer ABCDE: AH is calculjtted for B, C, D according to the formula: (AH(AB) +AK 
(BC) -f AH (CD) AH (BE)) / 4. The 64 codon triplets are shown in the first column of 



Table IB, Values for a codon in each of all jiossibk five-envelopes are shown in each 
row. For example, in the case of IJIJU, the enthalpy value for a UUl.1 codon preceded by 
a U and followed by a U is 2,80. llie entkiipic value when UUU 5s preceded by a U and 
is followed by a C is 2.45. The average value for a codoii in all possible "five- 
5 en velopes" is gi yen m tlie penultimate coluitm on the right .side of the table . For the 
UUU eodot5, the average for all possible 5 envelopes is 2.43. That average is caieuiated 
for all codons in Table IB. The final column (far right) of Table IB provides tire average 
enthalpic value for ail codoos having a common leading doublet. For example, all 
codons which begin with the doublet UU have an average enthalpic value of 2.1 1 , 

1 0 Tabic IC shows the values from the penultimate column of Table IB. Note that 

the values in Table IC hover <iround four values, 0.6, 1.2, 1.8, and 2,4, It can also be 
seen, as indicated in the caption of Table 1C\ that for any given doublet XX, the average 
enthalpic value for the codons XXU and XXA is about 0.6 higher tlian the average vahie 
for the codon.s .KXC and XXG. 

1.5 The energetic pattern evident in Table IC manifests itself in mRKAs. Table OA 

shows 16 enthalpicttlly defined codon groups (.sepajcated by dashed lines) produced by 
ranking the codons according lo the inieraction AM of Uie leading doublet, that is, the 
first two base pairs of the codon, and by the codon interaction entlialpy value from Table 
10. In Table IIA tiie first column shows all codons. The sect>nd column identifies the 

20 first doublet in the third ba-ses of the codon. The third colunui provides the AH of the 
first doublet, the fourtli column provides the nrnin codon AH over all 16 possible 
pentameric envelopes {a.s set out in Table IB, penultimate column) and the fifth column 
provides a letter for a group designation. The horizontal divisions segregate tfie first 
doublets according to the eight energy levels shown in Table LA.. Each of the groups 

25 thus formed by horizontal division is further subdivided on the basis of the average value 
for the codon for each of the 5 possible envelopes for Table IB and by which of tlie 4 
energy levels identified in Table IC it falls into, Table HB is analogous except that the 
first binarv' choice applied is the Ai ! for the second or final doublet of the codon. 

Table IIC shows the frequency of T able 11 A, or leading, codon groups and of 

30 Table IIB. oi following codon groups in a vest mR.NA database. 

The leading or L codon groups of Table iiC correspond to frame 1 of the mRNA 
and the final or F codon gK.>up.s in Table IIC correspond to frame 3 of the mRNA. Tfie 
middle column of Table IIC shows the difference in frequency beiwcen the L groups and 
tlie F groups shown in the first and last columns of Table IIC. It can be seen that the 

35 differences are ver>' small, which may be a consequence of an original evolutionary 
pentameric energy packaging scheme. 
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One possible explanation for this conserved "epiphenomciion" is that the presem 
day "ratchet 3/read 3" iranslation system evoJved from a "ratchet 3/read 5 an ceater" 
priraordlal translation system. Present day "frame shift suppreissor" tRHAs with 
anticodon loops greater than 3, are possibly mutant aimlogs of ancestor tRNAs which 
5 i-egularly read pcntamers. According to this view, as njtchet 3/read 3 translation systems 
evolved from ratchet 3/read 5 ancestral translation systems, mRN As would have had to 
be repackaged in one of two aJteniative reading frames different from the original 
reading frame. For example, an original evolutionary ratchet 3/read 5 on center system 
would read pentanier 1 2345 as 234. This corresponds to presem day frame 2. However, 

10 a ratchet 3/read 3 trimslation system reading from that same peniamer 1 2345, would read 
123, corresponding to present day reading frame K or else it would read 345, 
corresponding to present day reading frame 3. It is believed that the prevw^!ence of the 
"weak" bases U and A at the 5' ends of the aiiticodon loops of tRNA pentamers would 
favor repackaging of codons into present day frame 1 rather th;in into present day frame 

IS 3. 

If siich an evokilion from a ratchet 3/rejKi 5 on center to a ratchet 3/read 3 
trartslatioH system occurred, the resulting framesliift &om reading frame 2 to reading 
frame I would have tite poteiuiai to cause disastrous changes in protein structure as the 
alternate reading frame was read. There are at least two ways in which catastroplac 

20 mutations could be avoided. First, if tiie pentatner packets of the earliest mRNAs were 
read "loosely" by the earliest tRNA anticodon, tliat is, if etirly tRNAs could read either 
i23. 234, or 345 out of each pentamer, then the loose reading would result m 
cvoluuonarv pres'^urc to select mRN \s coalammg patkcts which woiiU not intrtiduee 
iutmfui amjao acids mio piotem pnmars strut^titres when tht packtt^ were read 

25 diiicreutlv eg v.h«n the pat-kcts ^nc^ i ' ujjc I ! itiiti thvin jn Jrami. 2 Seumd, 
it the mRNA<^ v,ere ^o sele^-ted In ni the lor i sv-icmK trjmcshtft wimki not 
nectssaniv mtryduLt h irr Aa <. o . u ^ tn > j.r< u.n prnnaty ^inittuics in number*. 
satb;.icnt iv d^magi. siiuii.iKo.LK! "^n ot v proitin and tn fa(.t migh pernnt the 
inisoduction of novei ..muno aciJ sequencer witii benelicuil ctLcts on protein sei^ondars 

30 and tertiary structures. 

This suggests that if a systemic fraraeshitt occunred, some codon distributions 
would have remained essentially unchanged ("constant" codons) while other codon 
distributions would have changed ("wild card"), which aiuld have a beneficial effect on 
pratein structure. In titis case, the evolntsonatj^ distinction betweeii "wild card" and 

35 "cohstant" codbn^ miglit classify amino acids in such a way as to enable tiie construction 
of a predictive amino acid alphabet. Accordingly, a binarj- choice alphabet was created 
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in which the "constant" vs. "wildcard" distinction was one binary choice parameter 
(Figure 2), 

Table IIIA shows possible entkilpic groups of ieading and final triplets m mRNA 
pentamers vvith the 64 codons as centers. An example is shown in Figure 2, in which the 
5 codon liUA is the center triple. The first coltjnm of Figure 2 shows the four possible 
leading L triplets together with the classification group from l able llA in the second 
column. The fourth column of Figure 1 1 shows the classifjcation group of the final (F) 
triplets shown in tite last colimin of Figure 1 1. 

As shown in Tabic IHA, dotibleis can be classified as "constant codon doublets" 

10 or "wild card codon doubiets". A constant codon doublet is a doublet XX of a codon 
XXY or XXR (Y and R stand for a pyrimidine ba«e or a purine base respectiveiy), in 
which XX is lJU, CC. OCK or AA, for wlrich codon, as shown m Table IIIA, the ieading 
(NXX) and final (XYN or XRN) triplets of all possible pentamers (N is any base), 
bt^long to the same cnthaipic groups of Tables IIA anci 11!^. f-or example, for die codon 

15 UUA (boxed line at upper lefi of Table IlIAK the four possible leading triplets (NUU) all 
belong to the groups Z and W. The four possible final triplets (UAN) also all belong to 
the poups Z> W, and X, Because U is a pyrimidine <Y) and A is a ptirine (R), UUA is a 
constant codon doublet of class YXR, A "wild card codon doublet", in contrast, shows 
an aitemation between enthalpic groups of Tables HA and IIB as the ieading and final 

20 triplets are analyzed over all pentamers. I-or example, for the codon UlRJ (top line at 
upper left of Table IIIA^, the four possible leading tnplets (NUU) belong to the groups 
Z, W and X, as noted above. 1 he four possible final triplets <ljLi>j) belong to the groups 
/ V ^ andU diOeiuii; from the Ladnc tnoletv Because U ss a pynmidmc (\), LUU 
IS Ki'nstant union doi'bletoKKi^s \ 

25 I he dislinctK)i\ Ixtuoen C(.» istant <,ouo ) Jo blcis and vMld vard codon doublets 

can be used to (.ottsttuU a lout iette'- in kk\ ahilnhet '^hawn m I igurt "■> the 04 
codoas c m r>e di\ idt^d nt toMtuu>i.i& vini-.tm*' \ X Uoubiels constant R \ ^ 
doublet, mdus. l^ \ \ .oisHot^ .e I u i IR \ R doublets 

'^s shov n \h \ iiiure 4, a test mR\ h dau -la--^ v a anatN /ed t<> dttetmnic thi, 

30 ficqiit ncits I t t\ii tour touon groups in the foui kuti annno a>.td alphabet oj. J ti,UK 

The tnRN \ dalibase was rcad m botii traine 1 and irame 2 As can be seen tiom 1 jgufv 
4, sliifting from reading in frame 2 to reading in frame 1 results in die interchange of 
fa^quencies of p and s. 



WOM18SJ4 PCmsS97/19673 

-30- 



Exam ple 2: Deiennmattoil of Secondary and Teriiarv Protein Struc turai i mature?; 
Correlated With Message See;ments BvaJMated yVrth a^^^^^ Alphabet 

A binary choke alphabet of Example I {s, p, d, t) was used to evaluate protein 
5 structures as follows: 

Test mRNA sequences were analyzed from a database of mRNAs (e.g., from 
GenBank). Note that in GenBank, uracil (U) is stoned as "T"; this convention will be 
used tbroaghoitt diis example. Each sequence was tlien analyzed in reading frame 2 
using the foiiowing mapping: 
10 Al'-r.'A'rC/GTlV(."5[X:-A 

AA"['MAC/<}AT/GA<:>C 

A(}T/AGC/r,Or/CiGC-D 

17A/TT(.)/CfA/CrfG=K 
15 TCAnCXj/CCA/CCO-F 

TAA/TACJ/C.AA'CA<:i=G 

TGAnrGO/CGA/CGG»H 

TTin-TC/CTT/CIC-I 

TCT,/TCC/CCT/CCC=J 
20 TAim-VC/CAT/CAC^K 

TGT/TGC/CGT/CGC=L 

ATA/ATG/GTA/GTG-M 

ACA/ACG/GCA/OCG-N 

AAA/AAG/GAA/CiAG-0 
25 AGA/AGG/GGA/GGO-F 

One binary choice pamrncter wa.^ v*. ht-ihcr the kduitig base oi the triplet was 
purine (A or v.i; groups A-D and M-P) or pyrimidinc (I oi t\ gioups L-L) 11k other 
binary choice pai^neter was tlie "wildcard" vs. "conblunl'' distuicUon dKiL,ii^scd m 

30 Example i, infra. It should be noted that this panuneter also corresponds to a hum \ 
choice between "symmetrieal" (YXY and RXR) codons vs. ''non-s>nimctricai" t YXR 
and RXY) codons (in which Y and R are pyrimidine and purine a? defined abovej. 

The mapped string from reading frame 2 was then converted to the binarj' choice 
alphabet {s, ji, d, t) according to the following scheme: 

35 ABCDEFGHIJKLK4N0P=ssssppppddddtttt. The result is a binary choice alphabet of 
degree 2, dividing the genetic code into 4 classes (denoted s, p, d, t), as shown in Figure 
3. 
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i he nifipned sfnii, w i-- / ^ f ^ ^c^^ \ i uj.n it uiekIow oj 1 h inpk'is {Id 
letteis 'n the ^pdt alphahea to Jet^rmme reg ons m whkh the i p d i latio ajs about 
3 ;5 3:5 in rciduig umit 2 (that li. s > 2, p 4, d > 2, i 4) When such a jegiun 
was foujid. ti e mRNA sequence was translated to an amino acid sequence tn frame 1 for 
5 that region of the mRNA (i.e., by reading the message resulting from adding a base at 
the beginning and ehmtnating a ba-se at the end of the message segment). Our protein 
database (described trtfra) was then searched for proteins which included the amino acid 
sequence encoded by the resulting Frame 1 amino acid sequence. When a single |->rotein 
was found to have two separate aird distinct regions with even low homology to tiic 
10 derived Frame 1 amino acid sequence, the two regions were often ibimd to have similar, 
or virtuaily idenltcai, secondary and tertiary structural features. When two different 
proteins were found, which each manifested one or more regions with even low 
homology to the derivett Friime 1 amino acid sequence, tiiese regions were often found 
to have very similar secondaiy and leriiary structural features, 

15 

Example 3: Startirig from a Known Protein Structure 

Binary choice aiplutbets (s, p, d, t) were used to evaluate protein structures as 
follows: 

Test mRNA sequences were read frotn a database of niRNAs (e.g., firom 

20 GenBank). Each sequence was then read in reading frame 1 and in reading frame 2 
using the mapping described in Example 2 for the 1 6-letter alphabet A-P. 

The mapped string from reading frame 1 was then converted to a binary choice 
alphabet (s, p, d, t) according to the following scheme: 
ABCDFroniJKI MNOP-pppps-,&btittddJ 

25 f he mapped stnng from redding frame 2 was tlien converted to a binar} choice 

alphabet (s, p. d. u according to the tollowmg scheme: 
ABrni FCillUKI MNOP ss^sppppd Iddittt 

Ihc nupp4,d stnngs ^^ttc the 1 jlua^Li, « " u\m;' windoH of mptrts 

{ lo ietteis m ihc spdt aiphabet). lO derenmne icgions in whuh the s p d t itiiut wa-, about 

30 "> s ^ m both trame 1 and tramc 1 When such a region v.d<^ iound tho niRX \ 

sequence was translated to an ammo acid sequence m both traxne 1 and frame 2 for that 
region of the mRN.A.. Our protein database was then searched for proteins which contain 
the amino acid sequence encoded by the translated region of Frame 2, The database of 
protein messages contained messages for three hundred proteins,^ those proteins being 

35 sixty to si x thousand amino acids iti lerigth. Fhe proteins included proteins with roles in 
protem synthesis, nucleic acid synthesis, pn>tein or nucleic acid degradation, various 
"house-keeping" enz>TOes, and some immunoglobulins. When a protein containing the 



sequence was found, tin stnjctiiral similarity (e.g., the tertiaTy structure) of that portion 
of the protein was compared to the structure of the protein encoded by the test mRNA 
sequeiice. 

h was found tiiitt for several test nsRNA sequences, many of those portions of the 
5 identified protems were structurally very similar to the comparable portions of the 
protem encoded by the test mRNA sequence. For example, a helix-strand transition in 
the prolcm encoded hv the test mR^3A sequence was structural!) &jm>lar to a helix-stiand 
traiMt'oii i>{ i piokin lo;, ited m the protem database j^Lordsni^ to .ne^hod'- of the 
in\cntion Apoiiuation oi the nuth(ia-< ot tnt invcntum a g thi, tntihods itf i vmipk 2 
10 ij a I \amp!v 3 ) lo i \ ir^vtv o« sec .nci.s id<.ntiiled •structural simikritv in at least 
one protein oi the our p-oim^ d 1 i^_.-,c *^0! th^i structural motifs such as sheets, helix 
curs heh\ exit, Pio-Hi^,-Fio tuinSv and iht hke 

Example 4 

1 5 1 he function of introns (e.g., non-coding DNA sequences in genomic DNA) is 

generally not well understood. Methods of the invention provide knowledge which is 
useful for investigating intron function. The methods of the invention can include 
searching nucleic acid databases (e.g., of genomic DNA) for regions of nucleic acid 
which do not code for protein in the present-day reading frame (i.e., frame 1), but which 

20 could code for protein in an alternate reading frame (e.g., frame 2 or frame 3). Such a 
presently non-coding region (i.e., an intron) could correspond to a region of a nucleic 
acid which was a coding region prior to a frameshift. Such formerly-coding regions 
could encode altemaie structures (i.e., protein regions which differ from the modern 
protein I'egions) which preserve the function of the protein. 

25 Thus, a nucleic acid which represents both coding and non-coding regions can he 

analyzed in both frames 1 and 2, a.s described supra for Examples 2 and 3. Where a non- 
coding region, such a.s an intron, is found in which the .s:p:d:t ratio is about 3:5:3:5 in 
frame 2, that region may coiTCspond to a region of the nucleic acid which coded for 
protein structure prior to a shift in reading frame. 

30 

Equivaktits 

Those skilled in the art will tecognize, or be able to ascertain using no more than 
routine experimentation, many equivalents to the specific embodiments of the invention 
described herein. Such equivalents are intended to be encompassed by the following 
35 claims. The contents of all references ci ted herein are hereby incorpomted by reference. 

What is claimed is: 
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1. A method of evahmting protein stii-L t-onpiiMn;,' 
providing a nucieic acid sequence whici !,nci,vLs ifiu piotejn structufe; 
assorting bases of the nucleic acid scquvnct.' iiik> sui))Ov.t {npltjis; a«d 
assigning one or a pluniHty of subject tripieus to one of a plurality of classes, 
5 whei-eiii tlie assignment is a function of classifying triplets, of the nucleic acid sequence 
as members of a class of a binary choice alphabet of n degrees of freedom, and wherein 
tiie dassses can be generated by applying n bmary choice paianieters to a triplet to yield 
at least 2* classes of subject tTiplets, wberein a binary choice parameter is a fitnction of a 
inessage-leve! property of tlie nncleic acid sequence, 
1 0 thereby eval uatiiig tbe protein structure. 



2. The method of claim i, further comprising making a record, on a machine 
readable medium of the class assigned to one or more triplets. 

1 5 3 . The metiiod of clitim I , wherein triplets are assigned to a first and a second 

class: 

the first class having the property tliat a message made of triplets drawn 
exclusively from the first class is less likely to form seeondar>' (intrachain) structure than 
is a message which is made of tripiets from botli tiie first class and the second class of 
20 triplets, and 

the second class having the property that a message matle of triplets drawn 
exclusively from the second class is less likely to form secondary (intrachain) structure 
than is a message which is made of triplets from both the first class and the second class 
of triplets. 

25 

4. The inothnd of ciaini whcrc.n tix- inc^sagc-lcwl propcrt> ts: a tuncuon of 
the { A Lonicnt of a ^u;■iet^ ;nr",e\ a tur.cnur; ofihc OC content of a subiect tsiplet; a 
functu)n o( Ihc m/c o; nioifci. ar v.^ichi >4 .i ;v;pLn. d funcUon of \i-hethertbt? iriplet is 
kelo nch m- ammo rich; a finiciion of whether tiie triplet is purine rich or p>runidine 

30 rsch. or ci tunctiun ot'a the enthalpy of the interaction between the triplet and a fully or 
partially complementary nucleic acid. 

5 . The method of elaim ! , wherein a subject triplet 456 of a nucleic acid 
seqvietice of bases 123 456 78^ is assigned into a class as a fuiiciion of: 

35 (1) pefforming one or more of (i), (ji);and (iil) 

(i) applying a binary choice paranjeter to a leading triplet of 456, e.g., to one or 
more of triplet 123, 234, or 345, to yield s leading value; 



(ii) applying a binary choice parameter to 456, to provide a center vahie; 

(Hi) applying a biiia y choice parameter to a following triplet of 456, e.g., to one 
or more of triplet 567, 678, or 789, to yield a foiioMdng value; 

(2) assigniftg one or a plurality of subject triplets 345 mto a class based on the 
values determmed k\ one or moi-e of (1), (3 ) aad (3), 
tliercby assighmg one oir a pJtirality of subject tripletsi into classes. 

6. A class-constant table of nearest neighbor relationships for atnino acid 
residues which provides, for each of a plurality of class constant nearest neighbors, a 
frequency of occurrence which is a function of the occurrence of the class constant 
nearest neighbor pair in a collection of at least 10 proteins, 

7. A method of evaluating a protein structure comprising: 

providing a class-constant table of nearest neighbor reiationship.s for amino acid 
residues; 

providing a nucleic acid which encodes a protein stiticturc; and 

comparing one or a plurality of the observed nearest tieighbor pairs in the protein 

structure with the frequencies provided by the class constant table, thereby evaluating 

the protein structure. 

8. The method of claim 7 wherein the comparison can include; assigning an 
expected frequency from the class constant table to one or a plurality of the observed 
nearest neighbor pairs and determining how many of the observed nearest neighbor pairs 
tali above or below a predetermined value; determining tiie likelihood of occurrence, as 
predicted by the class constant table, for an observed nearest neighbor pair; or 
determining if an obscrs'cd nearest neighbor pair of a first and a second amino acid 
lesidue from the protein stmcture is predicted by the class constant table to occur at a 
predetermined frequency. 

9. The method of claim 7, further comprising making a record of obser^'ed class 
constant nearest neighbors in the protein structure on a macMne-readable medium. 

10. A machine-readable litediiim on whichis recorded a class^cbnstaat nearest 
neighbor table. 



11, A method of evaluating a protein .striiciure for resistance to change, e.g., 
evolutionary or mutational change comprising: 



identifying regions ot a protei n which is encoded by mns of a single subcode, 
thereby identifying regions which have been resistant to change and which are tlierefor 
predicted to be functionally or stmcturaily signifjcant. 

5 12. 'Hie method of claim 1 { , wherein the method includes determining if tlie 

nucleic acid sequence which encodes the protein stmclure includes a mn of triplets at 
least 40 triplets in iength, in which at least 90% of the triplets in the run are from one 
class, 

10 HA met'ioU ol evaluating a protein sin. cim ioi the rrt •,cn<.^ oj ciitjcal ammo 

acid residues comprising: 

kiv-ntitsitij tittical cifi no i cue-. K idt^nutxt )j, nnnojjJv >.od mi- m runs 
(.iKtHtiu i \ nxlons s 1.1 >r ^uKodt thtiubv j knuivmg rv^uiiK nhiv,!) ha\e 
bCvH iCM^tdtU cii xh^i. mi V tn ti t '■etd ht ii ju'ittiondiK unpoituit 

15 

14 ihe method ot cJami 1 ^ ftheieui the evaluation atmpn^ts tdLnt'tving a 
tnplet Irom a first class m a mn of triplets of a second class at least 40 codons in iengtli, 
m which at least 40% of the codons are from the second class, thereby identjfymg the 
triplet of the first class as encoding a critical residue. 

20 

1 5. A method for evaluating a protein structure comprising: 
providing a nucleic acid sequence which encodes the protein structure; 
assorting bases of the nucleic acid sequence into subject triplets; and 
assigning at least one of the subject triplets to one of a plurality of classes, 

25 wherein the assignment is a function of classifying the subject triplets of the nucleic acid 
sequence under a binar\' choice alphabet of n degrees of freedom by applying n binar>' 
choice parameters to a triplet to yield at least 2" classes of subject triplets, wherein the 
assignment provides at least four classes of triplets, the at least four classes of triplets 
being represented in at least a portion of the nucleic acid sequence in a ratio of about 

30 3:5:3:5; 

thereby evaluating tlie protein structure, 

16. The method of claim 15y wherein the method incl udes making a tecord on a 
miichine-readabie mediimi of the class assigned to one or more triplets. 

35 



1 7. A method for ident ifying coding regions of a aueleic acid sequence, the 
method comprising: 
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providing the nucleic acid sequence: 

assorting bases of at least a portion of the nucleic add sequence into a plurality 

of subject tTiplets: 

assigning the plurality of subject triplets to one of a piuraiity of classes, wherein 
3 the assignment is a function of classifying the subject triplets of the nucleic acid 

sequence under a binary choice alphabet of n degrees of freedom by applying n binary 
choice pai^ameters to a triplet to yield at least T classes of subject triplets, wherein tlie 
assigmnem provides at least four classes of triplets A, B, C, and D; 

determining whether the plurality of subject triplets are distributed into the at 
1 0 least four classes of triplets A:B:C:D in a ratio of about 3:5:3 :5; 

thereby identifying coding regions of the nucleic acid sequence. 

19. The method of ciaim 1 ?, wherein the method inclitdes making a record on a 
machine-readable medium, of the class assigned to one or more triplets, 

15 

20. A method for identifying a protein thai includes a polypeptide portion which 
is structui-ally or fUnctionaliy similar to ail or a portion of a test protein, tlie method 
comprising: 

prnviding a nucleic acid sequence which encodes all or a portion of the test 
20 protein; 

assorting bases of at least a portion of the nucleic acid sequence into a plurality 
of subject triplets in a first reading frame; 

assigning the plurality of subject triplets in the first reading frame to one of a 
piuraiity of classes, wherein the assignment is a function of classifying the subject 
25 triplets of the nucleic acid sequence under a first binajy choice alphabet of n degrees of 
freedom by applying n first binary choice parameters to a tripiel to yield at least 2" 
classes of subject triplets, wherein the assigmneni provides at least four classes of 
triplets distributed in a ratio of about 3:5:3:3; 

assorting bases of the at least a portion of tiie nucleic acid sequence into a 
30 plurality of subject triplets in a second reading frame; 

^signing the plurality of subject triplets in the second reading frame to one of a 
piuraiity of classes, wherein the assignment is a function of classifying the subject 
triplets of tlie nucleic acid sequence under a second binary choice alphabet of n degrees 
of freedom by applying n second binary choice parameters to a triplet to yield at least 2" 
35 classes of subject triplets, wherein the assignment provides at least four classes of 
triplets distributed in a ratio of about 3:5:3:5: and 



■37- 



identifying a protein which ir.cUKles a polyijepiide portion encoded by the 
plurality of triplets in the second reading frame; 

thereby t(i<iniifying a pioiein that includes a polypeptide portion which is 
stracturaily or fimciionany simiJar to ail or a portion of the test proteio. 

3 

21, A method for Mentifying a mutatioft-prone region of a viral tiucleic acid 
sequence comprising: 

providing the nucieic acid sequence; 

assortittg bases of at least a portion of the nucleic acid sequence into a phirality 
1 0 of subject triplets in a first reading frame; 

assigning the phmtliiy of subject triplets in the first reading frame lo one of a 
plurality of classes, wherein the assignmeni is a function of ciassifying the subject 
triplets of the nucleic acid sequence undei' a bniai y choice alphabet of n degrees of 
freedom by applying n binary choice parameters to a triplet to yicid at least 2" classes of 
1 5 subject triplets:, wherein the assignment provides at least four classes of triplets 
distributed in a ratio of about 3:5:3;5; 

assorting bases of Uie at least a portion of the nucieic acid sequence into a 
plurfility of subject triplets in a second reading frame; and 

assigning the plurality of subject triplets in the second reading frame to one of a 
20 plurality of classes, wherein the a<5signment is a function of classifying the subject 
triplets of the nucleic acid sequence under a binary choice alplmbet of n degrees of 
freedom by applying n binary choice parameters to a triplet to yield at least 2" classes of 
subject iripiets, wherein the assignment provides at least four classes of triplets 
distributed in a ratio of about 3:5:3:5: 
25 tliereby identifying a mutation-prone region of the nucleic acid sequence. 
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Table lA: RNA Doublet Relative AH Enthalpies 
(from OrnsteiiijFresco) 



mifm 2.8$ 
aa/uu 2.80 
uu/aa 2.80 



m/m 2.07 



gu/ca IM 
ac/ug 1.91 



cu/ag 1.52 
ag/uc 1,52 



«c/ag 1.41 
ga/cu 1.41 



«g/ac 1.16 
ca/gu 1.16 



gc/cg 0,95 



cc/gg 0.27 
gg/cc 0.27 
cg/gc 0.00 
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aable XZ3i.i Gpdott Table 0««bly iSan&ed by AH of I,eading Ooablet 
<iad Haan Codon AH ov«r 1€ Possible 5-Smr«lopes 



X-a Meatt Codoa AH Over 

AK 16 J?o»sibie 5-Envelopes 

2.8S a.3S 

2.«0 2.43 

S,«0 2,35 

a.9S 2,S7 

Z.eo 2.36 

£.80 Z.A% 



I. SI 
1.91 
1.91 



1,63 



J. 91 i.83 

lv91 i.7S 

1.91 1,75 

l.Sl 1.60 



raJ8 <s« o I.S2 1.20 

age e 1.S2 

cug «■» e 1,S2 1.2« 

»S3 as e 3-52 i.20 



wo 98/18814 



11/15 



afabie IIA {Continued) 
-Bases- 1-2 Mean Codea AH Ov«r 



i.4i 

1.41 
1.41 

1.16 

i.a« 

1.16 
1.16 



l.SO 

a.7fi 

1.73 
1.84 
----- 

?•?■! 1.24 



1.16 
1.16 



1.21 
1.24 



1,20 
1.23 
1.21 



0.63 
0.60 
0.€3 
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Mean Codqn AH Over J,etter 

8.27 
2.43 

2.3S 
2.43 





-Bases - 


2-3 


Codoa 


a 


2-3 


dH 








2.86 








S.BO 








2.80 








2.8C 














a a 


2.80 
















£,eo 








2.80 




s 




3.e« 




s 




2,eo 


gas 




« a 


2.00 




t 




2 . 07 








2.07 



1.52 
1.52 
l.SS 
1.52 



I.7S 
1,63 
i.S4 
1.72 
1.83 
J. SI 



1.7S 
1.79 



1.84 
1.27 
1.20 
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Titbit Iia {Continued) 



-Bases- 2-3 Mean codon AH Over Letter 

1 2-3 AH 16 Possible S-Bnvelop«s Designation 



1.41. 
1-41 



1.64 
1,78 
1.75 
1,60 



0.27 1.17 

0.27 1.21 

0-00 1.21 

0.27 1.20 

0.27 1.20 

0.00 1.23 



0.27 0.63 

0.00 0.63 

0.21 0.68 

0.27 O.CO 

0.00 0,71 
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Table tXC 

Comparative Fre^encies of «!,«■ Codon Groups 
i» Fra«e 0 {123.,,) and "F" Codoa Groups x» Fiaase +2 {345 i 
(Protein taSRA Seaapls l) 







f ---- 


J, - F 




— «Fv Cirot 


spa 






% 


M 


% 


# 




Sj, 




12.23 


0.51 


11.72 


19766 




3li 


22334 


13.20 


-0.42 


13.62 


2296S 






SCSI 


1.71 


0.03 


1,68 


283S 




Ci. 


2565 


2.07 


-O.IS 


1.91 


3220 


Gf 


3£j, 


9911 




-0.23 


6.09 


10273 


Xr 


^1 


114X7 


6.75 


0.22 




11004 


Fr 




7026 


4.15 


-0,22 


4.37 


7367 






10S66 


6.48 


0.20 


6.28 


1C583 


Or 


Vj, 


1-5S80 


fi.79 


0.12 


8.67 


X4S22 


Vf 




12384 


7.32 


-0,19 


7. Si 


12663 


% 






3.51 


-0.47 


3.98 


6719 


Vr 




9331 


5,51 


0.4S 


5.06 


8S30 




fft 




3.60 


-0,15 


3.75 


S327 




1^ 


€355 


3.7€ 


O.IS 


3.61 


6082 


% 






■7.50 


0.2S 


7.2S 


12226 






13033 


7.70 


-0.27 


7.S7 


13446 
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